Motivation: the curse of dimensionality

Multivariate data is not just a challenge for visualization.

You have too many variables

  • Some are redundant in measuring essentially the same thing
  • Some are highly correlated with others (multicollinear)
  • Some are both

Your number of variables approaches your number of observations.

What is a composite variable?

A linear combination of variables

Multiple Regression Makes a Composite Variable

  • Linear combination of predictor variables that is maximally correlated with outcome variable
  • \(y = {\beta_1}X_1 + {\beta_2}X_2 + {\beta_3}X_3\)
  • \(\hat{y}\) is a composite variable

Multiple regression makes a composite variable

Milk investment across mammalian species

Principal components analysis

Goals:

  1. Create a set of composite variables that encompasses the shared variation among variables
  2. Reduce the dimensionality of a set of variables: from considering all variables separately to considering fewer composite variables, while
  3. Accounting for as much of the original variation as possible in the data set

No predictor or outcome variable. Think of it as a one sided equation.

Principal components

  • Sequential linear combinations of the original variables
  • The resulting composite variables are uncorrelated with one another
  • The first few usually account for most of the variation in the original data

Principal components

For q variables, you get q PCs

  • First encompasses the maximum variance
  • Each in turn maximizes remaining variance while remaining uncorrelated

Milk investment

Milk investment

Find the indices of the minimum and maximum of x1

themin <- which.min(M$Fat)
themax <- which.max(M$Fat)

Make a data.frame with the points for plotting

minpt <- data.frame(x1 = M$Fat[themin], x2 = M$Lactose[themin])
maxpt <- data.frame(x1 = M$Fat[themax], x2 = M$Lactose[themax])

Milk investment

Green = Low Fat, High Lactose

Blue = High Fat, Low Lactose

Major axis == PC1

Green = Low Fat, High Lactose

Blue = High Fat, Low Lactose

Minor axis == PC2

Green = Low Fat, High Lactose

Blue = High Fat, Low Lactose

Plotting PCs

  • Rotation of the axes of multivariate data

Green = Low Fat, High Lactose

Blue = High Fat, Low Lactose

Comparing PCs

Green = Low Fat, High Lactose

Blue = High Fat, Low Lactose

Principal components

prcomp() is the preferred function for PCA (not princomp()):

z <- prcomp(~ Fat + Lactose, data = M, center = TRUE, scale. = TRUE)
  • One sided formula: ~ Fat + Lactose
  • Centered to a mean of 0
  • Scaled to standard deviation of 1

prcomp objects

str(z)
## List of 6
##  $ sdev    : num [1:2] 1.393 0.242
##  $ rotation: num [1:2, 1:2] 0.707 -0.707 -0.707 -0.707
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "Fat" "Lactose"
##   .. ..$ : chr [1:2] "PC1" "PC2"
##  $ center  : Named num [1:2] 34 49.6
##   ..- attr(*, "names")= chr [1:2] "Fat" "Lactose"
##  $ scale   : Named num [1:2] 14.3 14.1
##   ..- attr(*, "names")= chr [1:2] "Fat" "Lactose"
##  $ x       : num [1:29, 1:2] -1.785 -1.444 -1.962 -2.066 -0.514 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:29] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:2] "PC1" "PC2"
##  $ call    : language prcomp(formula = ~Fat + Lactose, data = M, center = TRUE, scale. = TRUE)
##  - attr(*, "class")= chr "prcomp"

x is a matrix of the the principal components

Proportion of variance (eigenvalues)

summary(z)
## Importance of components:
##                           PC1     PC2
## Standard deviation     1.3934 0.24158
## Proportion of Variance 0.9708 0.02918
## Cumulative Proportion  0.9708 1.00000

PC1 explains 97% of the variance in the data.

Loadings (eigenvectors)

Correlations with with composite variable

print(z)
## Standard deviations (1, .., p=2):
## [1] 1.3934265 0.2415836
## 
## Rotation (n x k) = (2 x 2):
##                PC1        PC2
## Fat      0.7071068 -0.7071068
## Lactose -0.7071068 -0.7071068
  • Lactose loads negatively on PC1 and PC2
  • Fat loads positively on PC1 and negatively on PC2
  • Magnitudes are more informative with more than 2 variables

Extracting PC scores

PC <- data.frame(pc1 = z$x[, 1],
                 pc2 = z$x[, 2])
PC
##            pc1          pc2
## 1  -1.78509431 -0.063653225
## 2  -1.44365796  0.013484299
## 3  -1.96166269  0.006259316
## 4  -2.06645523 -0.177723875
## 5  -0.51393051  0.150315276
## 6  -0.91347766  0.350637364
## 7  -0.07717306  0.351480511
## 8   1.90779088 -0.014528235
## 9   1.55556364  0.358828937
## 10  1.95386703  0.311684874
## 11  0.95822267  0.229702746
## 12 -2.55254845  0.423074692
## 13  0.62357110  0.189046494
## 14  0.53522693  0.247205046
## 15  0.42873297 -0.083992416
## 16  0.69910152  0.089367582
## 17  1.73099540 -0.155687913
## 18  1.98506466 -0.145129935
## 19  1.36259231 -0.113289467
## 20  1.75461984 -0.256788764
## 21 -1.99486571 -0.210072095
## 22 -0.14099491 -0.192434707
## 23 -0.90363007 -0.154254327
## 24 -0.07624607 -0.451377680
## 25 -0.68540391 -0.011259289
## 26 -0.11968797  0.216906468
## 27 -1.51615118 -0.248076624
## 28 -0.06089007 -0.342972655
## 29  1.31652081 -0.316752397

Extracting PC scores

Green = Low Fat, High Lactose

Blue = High Fat, Low Lactose

##         pc1         pc2
## 1 -1.785094 -0.06365323

Milk investment: PCA approach

  1. Do fat and lactose together predict milk energy?
PC1 <- z$x[, 1]
summary(lm(Milk_Energy ~ PC1, data = M))
## 
## Call:
## lm(formula = Milk_Energy ~ PC1, data = M)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.10569 -0.07164  0.01139  0.04552  0.11731 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.641724   0.012138   52.87  < 2e-16 ***
## PC1         0.106277   0.008865   11.99 2.54e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06537 on 27 degrees of freedom
## Multiple R-squared:  0.8418, Adjusted R-squared:  0.836 
## F-statistic: 143.7 on 1 and 27 DF,  p-value: 2.539e-12

Milk investment: Multiple Regression

  1. Do fat and lactose together predict milk energy?
fm_Multi <- lm(Milk_Energy ~ Fat + Lactose, data = M)
summary(fm_Multi)
## 
## Call:
## lm(formula = Milk_Energy ~ Fat + Lactose, data = M)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.11350 -0.05047  0.01103  0.04649  0.12701 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.007372   0.211168   4.770 6.16e-05 ***
## Fat          0.001952   0.002533   0.771  0.44784    
## Lactose     -0.008709   0.002575  -3.382  0.00229 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06447 on 26 degrees of freedom
## Multiple R-squared:  0.8518, Adjusted R-squared:  0.8404 
## F-statistic: 74.74 on 2 and 26 DF,  p-value: 1.657e-11

Mammal life history

M <- read_excel("../data/mammals.xlsx", na = "NA") %>%
  dplyr::select(litter_size,
                adult_body_mass_g,
                neonate_body_mass_g,
                max_longevity_m,
                sexual_maturity_age_d) %>% 
  rename(Litter_Size = litter_size,
         Adult_Mass = adult_body_mass_g,
         Neonate_Mass = neonate_body_mass_g,
         Longevity = max_longevity_m,
         Maturity_Age = sexual_maturity_age_d) %>% 
  drop_na()

Mammal life history

~ . means all columns

z <- prcomp(~ .,
            data = M,
            center = TRUE,
            scale. = TRUE)

Centering and \(Z\)-scaling are important.

  • Should be the default but are not.

Use factoextra for more handy functions

remotes::install_github("kassambara/factoextra")
library(factoextra)

factoextra has several convenience functions for working with PCA.

Mammal life history

get_eig(z)
##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1  2.6828230        53.656461                    53.65646
## Dim.2  1.4775239        29.550478                    83.20694
## Dim.3  0.6218664        12.437328                    95.64427
## Dim.4  0.1541320         3.082640                    98.72691
## Dim.5  0.0636547         1.273094                   100.00000

Eigenvalues are variances. Default R print() method returns standard deviations.

Mammal life history

print(z)
## Standard deviations (1, .., p=5):
## [1] 1.6379325 1.2155344 0.7885851 0.3925965 0.2522988
## 
## Rotation (n x k) = (5 x 5):
##                     PC1        PC2        PC3         PC4         PC5
## Litter_Size   0.3029675 -0.4765735  0.8184362  0.09980480  0.03591135
## Adult_Mass   -0.4462968 -0.5301452 -0.1399632 -0.26489329  0.65574932
## Neonate_Mass -0.4761971 -0.4894702 -0.0803547  0.02829269 -0.72553281
## Longevity    -0.5400521  0.2503846  0.2471013  0.74134502  0.18708228
## Maturity_Age -0.4365890  0.4353737  0.4930076 -0.60784672 -0.08547258

Mammal life history

fviz_eig(z, addlabels = TRUE)

Mammal life history

fviz_pca_var(z)

Mammal life history

fviz_pca_var(z, axes = c(2, 3))

Sample size and other concerns

Suggested sample sizes vary:

  • n = over 50
  • n:q > 5:1 (often violated with genomic data)
  • n and n:q are both large

All data:

  • Numeric (continuous)
  • No missing values

Best case

  • A small number of variables that can be used as surrogates for the larger set without too much loss of information
  • Lower dimensional summary of a larger set of variables

PC1:

  • Variables that combined account for the most variance
  • For morphological data:
    • Can often be a proxy for “size”
    • Remaining PCs are “shape”

Drawbacks

  1. Lose the original variable identity
    • Interpretation can be a challenge
  2. If centered and scaled (advised), you lose the original scale